Introduction

The topic of Parallel Computing explores how parallel machines work and how algorithms can be parallelized, analysing all the gains and drawbacks of such a choice.

Measuring the performance of a parallel algorithm is not straightforward: multiple metrics are involved and, depending on the context, different assumptions must be made.

Starting from the abstract representation of a parallel machine, different architectures will be presented through this document.

PRAM Machine and Performance Evaluation

The PRAM (Parallel Random Access Memory) machine is an extension of the conventional RAM machine. It is used to model a theoretical parallel machine able to run parallel algorithms.

PRAM is easy and intuitive and eliminates a great amount of synchronization issues, allowing us to focus on understanding the algorithm. Variations of the PRAM exists in the real world (e.g. CUDA enabled GPUs) and abtract PRAM algorithms can be easiliy adapted to other parallel architectures.

The PRAM model has the following characteristics:

infinite number of processors each one knowing its own index;
infinite number of registers per processor;
infinite number of shared memory cells;
infinite number of input memory cells;
infinite number of output memory cells;
each cell can store an integer of an arbitrary length;
all processors can access any memory cell at the same time;
processors communicate over shared memory.

In a PRAM, all the processors executes the same instructions at the same time. For this reason cocurrent writes and/or reads to/from the same memory cell may happen.

PRAM machines can be classified with respect to their ability to perform concurrent reads/writes the memory: reads/writes can be exclusive (all processors can concurrently read/write to/from distinct memory cells) or concurrent (all processors can concurrently read/write to any memory cellm, even the same one).

While concurrent reads are trivial and are not an issue, what happens when two or more processors try to write something to the same memory cell? Three possible solutions:

Priority CW: the value that is written to the memory is the one coming from the processor priority;
Common CW: the value gets written if and only if all the processors that are trying to perform the write are writing the same value (if this does not happen, the state of the cell becomes undefined);
Random CW: only one randomly chosen processor is allowed to perform the write.

While the priority and random CW machines are trivial to understand, common CW machines may be not so a clarificatory example is provided.

Assume you have a machine that has to evaluate the disjunction of a large number of formulae. The way the formulae are read from the input is not of our interest.

The machines start initializing a chosen output cell (e.g. the first) with a value representing “false”. This may happen by making only a chosen processor write to the cell or making all the processors writing to the same cell at the same time (that works because the same value is written).

After that, each processor starts evaluating their formula and, if the formula was evaluated to be “true”, writes it to the output cell, otherwise does nothing.

This algorithm works because a disjunction of formulae is true if and only if at least one of the formulae is true. If no formulae were evaluated to be true, the initial “false” value will not be modified while if one or more formulae were avaluated to be true, the corresponding processors will write a value representing “true” to the cell.

This is a valid algorithm because if multiple processors write to the output cell, they are writing the same value so there are no undefined behaviors.

There are other types of PRAM machines (mostly obtained constraining the amount of resources available): machines with a bounded shared memory, machines with a bounded number of processors, machines with a bounded word size and machines with other constraints on simultaneous memory access. Those kinds of machines will not be considered for the time being.

Let A, B \in \{\text{EREW}, \text{CREW}, \text{Priority}, \text{Common}, \text{Random}\}. We define an order relation A \le B \iff \text{any algorithm written for } A \text{ will also work unchanged on } B. If A \le B then B is computationally stronger than A but it is also less realistic.

The more powerful is a PRAM, the least realistic it is.

With respect to the aforedescribed order relation, all the PRAMs can be sorted as follows:

\text{EREW} \le \text{CREW} \le \text{Priority} \le \text{Common} \le \text{Random}

As said before, measuring the performance of an algorithm running on a PRAM machine is not trivial: the existence of multiple processors running in parallel makes everything harder. Moreover, not all processors may be running at the same time (this is not in contrast with the fact that each processors run the same instructions at the same time, this will be explained later in this document) so only a fractions of them are working while the others are just idling.

While explaining all the performance metrics involved in the process of measuring, “to solve a problem of input size n will be used as a synonim to”running an algorithm whose input can be expressed as a function of n“.

Let T^*(n) be the time it takes to solve a problem of input size n on a single processor using the best sequential algorithm currently available, T_p(n) the time it takes to solve the same problem using p processors and T_\infty(n) the time it takes to solve the same problem with any number of processors (read that as “as many processors that can be used”).

The speedup is defined as how faster can we run the algorithm using p processors and it is calculated as

SU_p(n) = \frac{T^*(n)}{T_p(n)}

The efficiency is defined as how good we are using each single processor and it is calculated as

E_p(n) = \frac{T_1(n)}{p T_p(n)}

An efficiency close to 1 means that the processors are almost never idling while an efficiency close to 0 means that the processors are almost always idling.

For a finite-size problem, it is useless to make p greater than the SU: at some point, there will be not enough data to occupy all the processors and the efficiency will decrease.

If T^* \simeq T_1 (that is a realistic assumption) then

E_p(n) = \frac{T_1(n)}{p T_p(n)} \simeq \frac{T^*(n)}{p T_p(n)} = \frac{SU_p(n)}{p}

Let C(n) be the cost (in a currency of choice, AWS credits or any similar unit) of using a PRAM to solve an n-sized problem, P(n) the number of processors used and T(n) the total time for which the processors were used, then C(n) = P(n) \cdot T(n).

Let W(n) the total work (i.e. the total number of operations) performed by a PRAM machine.

As processors may be idling, it is true that W \le C.

The energy consumption of the PRAM machine is proportional to \frac{W}{T_p}.

We mentioned before that we may apply constraints to the unbounded PRAM machine: in this way we get a more realistic version that can be used to analyze more in detail its practical application (as an example, CUDA enabled GPUs are really similar to a constrained PRAM machine).

Two lemmas are given to simplify the computation of performance metrics in the case of specific constraints applied to PRAM machines.

Any problem that can be solved in T steps using P processors can also be solved using P' \lt P processors in O(TP/P') steps.

It is possible to obtain this assigning multiple tasks (defining task as what is done on a single processor of the bigger PRAM) so a single processor of the smaller PRAM.

Any problem that can be solved in T steps on a PRAM with p processors and m memory cells can be solved on a machine with \max(p, m') processors and m' memory cells (with m' \lt m) on O(Tm/m') steps.

Scaling: Amdahl’s law vs. Gustafson’s law

Amdahl’s law and Gustafson’s law are two laws used to determine how the performance of a parallel machine will scale. Both laws are more or less realistic but they use different assumptions so the context in which the laws is really important.

According to Amdahl’s law, any program consists of an interleaving of non-parallelizable (serial) segments with parallelizable segments.

In this case, T_p \gt \frac{T_1}{p} because the serial segments will take always the same time to complete (they run on a single processor and cannot be parallelized). This means that SU \lt p.

Let f be the fraction of the program that is paarallelizable (so that 1-f is the fraction that is not parallelizable), then the speedup according to Amdahl’s law is computed as

SU(p, f) = \frac{T_1}{T_1 \cdot (1-f) + \frac{T_1 \cdot f}{p}} = \frac{1}{1 - f + \frac{f}{p}}

With an infinite number of processor, the parallelizable segments of the program will take no time to execute:

\lim_{p \to \infty} SU(p, f) = \frac{1}{1 - f}

According to the formulae, Amdahl is pessimistic: the performance increase obtained by adding new processors decreases with the number of processors. This means that there will be a point where the performance increase cannot justify the cost of a new processor.

While Amdahl assumes that the portion of the algorithm that is parallelizable is fixed, Gustafson drops this assumption.

According to Gustafson, it is possible to increase the size of the problem to make use of more processors: let s be the fraction of the time used to execute the serial part of the algorithm (so that 1 - s is the time to complete the parallelizable part of the algorithm), then

SU(p) = \frac{T_1}{T_p} = \frac{s + p \cdot (1 - s)}{s + (1 - s)} = s + p \cdot (1 - s)

Note that, according to Gustafson, the speedup depends only on the number of processors so it is always a good idea to increase processors when the size of the problem increses (linear speedup, Gustafson is optimistic).

Strong scaling is the ability of a system to improve performance of a program increasing the number of processors while keeping the problem constant.

This type of scaling is often analyzed using Amdahl’s law.

Weak scaling is the ability of a system to maintain the efficiency when the number of processor is modified proportionally to how the workload size changes.

This type of scaling is often analyzed using Gustafson’s law.

Gustafson’s law assumptions better fit the reality because, currently, there is an almost infinite amount of data that can be processed in countless ways just to produce and extract even more data. As computing power, nowdays, is relatively cheap, it is almost always possible to add more processors to get faster results.

Computer architectures

A computer program is just a list of elementary instructions that are executed by the processor. A single instruction may take more than one clock cycle to be completely executed (expecially common in CISC processors) or, in some processors, multiple instructions ae executed at the same time (e.g. in Sony PlayStation2 EmotionEngine’s Vector Units or in XDNA NPUs Compute Tiles). It may even happen that different parts of the CPU handle different phases of the execution of different instructions at the same time (pipelining). Instructions (and data) are stored in memory.

Processor architecture

Nowdays, CPUs are really complex machines: you can have a quick look at the evolution and the complete specification for almost every Intel CPU from the 1978 up to the present days in this handy 5198-pages-long manual.

If you are interested, a good amount of documentation about the aforementioned Compute Tiles is availabe here (although at the time of writing I could not find the instruction set reference).

In the following paragraphs we will discuss about all the techniques used by hardware designers to get the most out of the silicon. While the techniques are presented separately, it is really common to find a combination of multiple of them actually being used in real processors.

A processor is generally composed by three functional blocks:

Fetch/Decode Unit (FDU): determines what the processor must do;
Arithmetic Logic Unit (ALU): actually executes the instructions;
Execution Context (EC): contains the register files, the CPU FSM and all the state that has to be preserved on the CPU.

Let’s consider the simplest possible single-cycle CPU.

A single-cycle CPU is the simplest kind of processor and it executes a single instruction per clock cycle (one clock cycle is the same as one machine cycle).

Single-cycle CPUs are usually used for educational puproses only as they are less complex than other types of cpu but are also slower (this is because the clock speed must be lowered to give the slowest critical path enough slack time). In this section we do not care about splitting instructions in simpler steps for performance, we only consider parallelization.

Single-cycle CPUs are usually paired with pipelines because it is almost impossible to read from memory, perform a computation step and then write back to memory all in a single cycle.

The Program Counter (PC) register contains the address in memory of the next instruction to execute.

How can the performance of the simple CPU be improved? Hardware designers can choose between plenty of different parallelization strategies.

Superscalar processors

If an instruction A uses the results of another instruction B, then A is dependent on B (this is called data dependency) so the processor cannot execute A before or while executing B. This dependency is obviously a transitive property.

It may happen that there is no dependency between two (or more)instructions so that they can be executed simoultaneously: this is exactly what superscalar processors do.

Superscalar processors are composed of multiple FDUs, multiple ALUs (which all work on the same execution context) and an Out-of-order control logic that finds independent instuctions and feeds them into the different FDUs.

It is possible that there is a long chain of dependent instructions: in that case, only one FDU and one ALU are occupied while the others are left waiting.

Multicore processors

Plain superscalar processors used lots of silicon to make a single instruction stream run fast: big caches, out-of-order execution units, branch predictors and pre-fetchers would occupy a large die area compared to the actual processing power.

What if, instead of using silicon to make smarter processors, it was used to make more processors, maybe on a single die? Meet Multicore processors.

Multicore processors are, as the name implies, composed by multiple independent processor cores on a single die. Each core may be simpler and slower than a full-featured superscalar processor but there is a big potential for speedup and the entire chip is easier to design and manufacture in general.

Of course, if one does not exploit the existence of multiple cores, silicon can be considered wasted and performance will be pretty bad.

Multithreading

To exploit multicore processors, the programmer must create multiple threads that will run in parallel but what happens if the programmer creates more threads than the ones that are available in hardware? Multicore systems often come with a scheduling system that manages the execution of the threads, possibly interleaving them if the available hardware threads are not sufficient.

The process of pausing one thread, saving its state and loading another thread is called context switch and it does consume time. To try save some of the time used to switch between different threads, multiple Execution Contexts (hardware threads) are placed on a single core to reduce the need to access memory to save the state of a thread.

If the hardware threads are not sufficient, context switch will still need to access the memory.

Multithreading also helps masking the issues caused by the memory latency as described in this section.

SIMD Processors

SIMD is an acronym that stands for Same Instructions Multiple Data. SIMD processors are able to apply the same transformation to entire vectors of data. For this reason, SIMD instructions can also be called vector instructions.

In the silicon, SIMD instructions are implemented by placing multiple ALUs in the same core.

The number of lanes describes how many elements can be processed at the same time: as an example, to perform the elementwise sum of two vectors of 16 int32_ts values, you will need 16 32-bit lanes.

Depending on the architecture, ALUs can be combined to process bigger data: instead of the 16 32-bit lanes from the example above, it may be possible to use the same ALUs to compute the elementwise sum between two vectors of 32 int16_ts (32 lanes) or between two vectors of 8 int64_ts (8 lanes).

Intel’s interpretation of SIMD instructions is composed of AVX (Advanced Vector eXtension) instructions.

Conditional execution

It is possible that not all the elements in a vector need to be processed in the same way: branching (i.e. if-else) is used. In this case, the condition is applied to each element in the vector and a mask is produced containing the result of the comparisons.

After the mask has been produced, all the branches are executed and, according to the mask, each result can be saved or discarded.

This means that it may happen that not all the results are used so the ALU work is wasted. This decreases efficiency by a lot.

Coherency is the property of a program where the same instruction sequence applies to all the elements to be processed (i.e. if all the elements are processed by the same code path). Coherency is necessary only if all the threads must eecute the same instruction at the same time (like in CUDA).

Memory access

We have quickly discussed how parallelism can be achieved in a single processor but instructions and data is stored in memory: the most powerful processor cannot do anything if the memory is slow.

The memory is just an array of bytes indexed with incremental addresses.

Memories require time to read/save a value: that time is called Memory access latency. Here you can find the datasheet (which also includes timing information) for a very old memory chip: it is simple enough to be understood without any special knowledge.

High memory latencies forces the processor to remain idle (stalls) waiting for the requested data/instructions. Hardware developers adopt multiple strategies (or a combination of them) to minimize this problem:

caches (developers may place fast caches close to the processor);
prefetching (branch prediction logic also try to prefetch instructions and data that will probably be used in the future so that the processor does not need to wait for them);
same-core multithreading (while one thread is waiting for data, another uses the ALUs; programs with more arithmetic per memory access needs less threads to hide the latency).

While those strategies are extensively used in basically any system, there exist also some exotic architectures that are designed explicitly to not make use of any of them: XDNA NPUs are an an example.

In XDNA NPUs, each Compute Tile have plenty of vector registers and a dedicated L1 memory containing instructions and data. Each tile can read/write directly from/to the neighbor tiles’ memories. Compute Tiles are grouped in columns, each column has also a big Memory Tile (L2 memory). Communication between tiles is scheduled at compile time and is synchronized using hardware locks. There is no caching or prefetching: data is always where it is expected to be and execution time is always predictable (no cache misses).

Multithreading to mask latencies

A multi threaded processor can hide the memory latencies by performing arithmetic from other threads while one thread is waiting for the memory. Programs that needs to do more arithmetic operations per memory access need fewer threads to completely mask stalls due to memory latency.

As said before, to implement hardware support for multithreading, multiple Execution Contexts must be placed in the processor. There are two main types of multithreading that can be implemented in hardware: Interleaved multithreading and Simultaneous multithreading.

Interleaved multithreading

With Interleaved multithreading, at the beginning of each time interval, the core decides what thread to run, giving it access to all the ALUs.

Interleaved multithreading can be coarse (context switch happens only on long stalls, loke memory stalls) of fine (constext switch may happen up to every clock cycle, maximum efficiency).

Simultaneous multithreading (SMT)

With Simultaneous multithreading, at the beginning of each time interval, the core assigns the ALUs to instructions from different threads (e.g. Intel Hyper threading).

The first Intel CPU to support hyper threading was the Pentium 4 HT (where HT stands exactly for Hyper Threading). The complete datasheet for the Pentium 4 HT can be found here. For more information on the HT technology, please refer to the same handy manual as before.

With SMT, threads run when there are resources assigned to them.

To be continued

Indice

Introduction

PRAM Machine and Performance Evaluation

Scaling: Amdahl’s law vs. Gustafson’s law

Computer architectures

Processor architecture

Superscalar processors

Multicore processors

Multithreading

SIMD Processors

Conditional execution

Memory access

Multithreading to mask latencies

Interleaved multithreading

Simultaneous multithreading (SMT)